Risk-Averse Multi-Armed Bandit Problems Under Mean-Variance Measure
نویسندگان
چکیده
منابع مشابه
Algorithms for multi-armed bandit problems
The stochastic multi-armed bandit problem is an important model for studying the explorationexploitation tradeoff in reinforcement learning. Although many algorithms for the problem are well-understood theoretically, empirical confirmation of their effectiveness is generally scarce. This paper presents a thorough empirical study of the most popular multi-armed bandit algorithms. Three important...
متن کاملPure Exploration for Multi-Armed Bandit Problems
We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast...
متن کاملMulti-armed Bandit Problems with Strategic Arms
We study a strategic version of the multi-armed bandit problem, where each arm is an individual strategic agent and we, the principal, pull one arm each round. When pulled, the arm receives some private reward va and can choose an amount xa to pass on to the principal (keeping va−xa for itself). All non-pulled arms get reward 0. Each strategic arm tries to maximize its own utility over the cour...
متن کاملMulti-armed Bandit Problems with History
In a multi-armed bandit problem, at each time step, an algorithm chooses one of the possible arms and observes its rewards. The goal is to maximize the sum of rewards over all time steps (or to minimize the regret). In the conventional formulation of the problem, the algorithm has no prior knowledge about the arms. Many applications, however, provide some data about the arms even before the alg...
متن کاملRobust Risk-Averse Stochastic Multi-armed Bandits
We study a variant of the standard stochastic multi-armed bandit problem when one is not interested in the arm with the best mean, but instead in the arm maximising some coherent risk measure criterion. Further, we are studying the deviations of the regret instead of the less informative expected regret. We provide an algorithm, called RA-UCB to solve this problem, together with a high probabil...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Journal of Selected Topics in Signal Processing
سال: 2016
ISSN: 1932-4553,1941-0484
DOI: 10.1109/jstsp.2016.2592622